Introduction to Open Data Science - Course Project

About the project

This is my RMD-output site for course exercises in Open Data Science MOOC-course. My repository for this course can be found here

I’m new to version control in github, so let’s see if this becomes a habit. :)

Done: * Registered to github, installed it and pushed my first files * Committed and pushed updated files to git


Exercise 2

Read the wrangled data

This data consist of seven variables and 166 observations. The data was collected in 2014 (Vehkalahti, 2014) from students attending to a statistics in social sciences course. The survey assessed attitudes towards statistics, individual learning styles and points achieved in the course by the individual. In the data used here, items have been aggregated (Attitude, deep, strat, surf), and divided by the N of items. Further, rows where Points = 0 were removed from the dataset (N = 183, after removing zeros, N = 166). Find my R-file for data wrangling here.

More information of the dataset at hand can be found here

Kimmo Vehkalahti: ASSIST 2014 - Phase 3 (end of Part 2), N=183 Course: Johdatus yhteiskuntatilastotieteeseen, syksy 2014 (Introduction to Social Statistics, fall 2014 - in Finnish), international survey of Approaches to Learning, made possible by Teachers’ Academy funding for KV in 2013-2015.

setwd("C:/Users/rekar/Documents/Road to PhD/Kurssit/Open data science/IODS-project/data")
lrn <- read.csv("lrn.csv")



library(tidyverse)
library(viridis)

#Rename gender variable for plotting purposes
lrn <- 
  lrn %>%
  mutate(gender=recode(gender, "M"="Male", "F"="Female"))




head(lrn)
##   gender Age Attitude     deep  stra     surf Points
## 1 Female  53      3.7 3.583333 3.375 2.583333     25
## 2   Male  55      3.1 2.916667 2.750 3.166667     12
## 3 Female  49      2.5 3.500000 3.625 2.250000     24
## 4   Male  53      3.5 3.500000 3.125 2.250000     10
## 5   Male  49      3.7 3.666667 3.625 2.833333     22
## 6 Female  38      3.8 4.750000 3.625 2.416667     21

Summarising and visualising data

Below you can find a summary of the data produced by describe from the psych-library. After this I examined the data visually using ggpairs. After this I ran two plots to show the distributions more clearly and further run a few geom_smooths for fun. I tried to add the viridis colouring to the ggpairs, but it didn’t get applicated to all of the plots. If you, dear student-peer-reviewer, have a solution to this, please comment your solution in the peer-review.

First we take a look at the distribution of age and gender, and we see that the males attending to this course were older than the females. Second we look at the distribution in attitudes towards statistics, where we find the mean (marked as an asterix) and median being higher among males. Third, I was interested if age had any association to the points acquired, for which a a loess smooth was calculated to overfit the data.

After this I plotted a few scatter plots with geom_smooths (OLS regressions) to assess relationships among variables. We find a clear relation between attitude and points earned in the course. After this relation of attitude was examined to learning styles. Visually this relationship seems to be only among males.

Further, the relation of learning strategies to acquired course points were assessed. Here we can find and association of learning strategies to points earned and high learning strategy associated with lower course points. However, CI-bands are quite wide.

#Using the psych library for extensive summary 
library(psych)
library(GGally)
describe(lrn)
##          vars   n  mean   sd median trimmed  mad   min   max range  skew
## gender*     1 166  1.34 0.47   1.00    1.30 0.00  1.00  2.00  1.00  0.68
## Age         2 166 25.51 7.77  22.00   23.99 2.97 17.00 55.00 38.00  1.89
## Attitude    3 166  3.14 0.73   3.20    3.15 0.74  1.40  5.00  3.60 -0.08
## deep        4 166  3.68 0.55   3.67    3.70 0.62  1.58  4.92  3.33 -0.50
## stra        5 166  3.12 0.77   3.19    3.14 0.83  1.25  5.00  3.75 -0.11
## surf        6 166  2.79 0.53   2.83    2.78 0.62  1.58  4.33  2.75  0.14
## Points      7 166 22.72 5.89  23.00   23.08 5.93  7.00 33.00 26.00 -0.40
##          kurtosis   se
## gender*     -1.54 0.04
## Age          3.24 0.60
## Attitude    -0.48 0.06
## deep         0.66 0.04
## stra        -0.45 0.06
## surf        -0.27 0.04
## Points      -0.26 0.46
ggpairs(lrn, mapping = aes(col=gender, alpha=0.2), lower=list(combo = wrap("facethist", bins =20))) + scale_colour_viridis_d(option = "H")

#Age * sex
lrn %>%
  ggplot(aes(gender, Age, colour=gender)) +
  geom_violin() +
  scale_colour_viridis_d(option="H") +
    geom_boxplot(width=0.2, size=0.2, color="black", alpha=0.4, outlier.size = 0) +
    stat_summary(fun=mean, geom="point", size=1, color="black", shape=8) +
  theme_bw() 

#Attitud * Sex
lrn %>%
  ggplot(aes(gender, Attitude, color=gender)) +
  geom_violin() +
  scale_colour_viridis_d(option="H") +
    geom_boxplot(width=0.2, size=0.2, color="black", alpha=0.4, outlier.size = 0) +
    stat_summary(fun=mean, geom="point", size=1, color="black", shape=8) +
  theme_bw()

#Age's relation to points
lrn %>%
  ggplot(aes(Age, Points, color=gender)) +
  geom_point(alpha=0.2) +
  geom_smooth(method=loess) +
  scale_colour_viridis_d(option="H") +
  theme_bw()

#Attitude * points
lrn %>%
  ggplot(aes(Attitude, Points, color=gender)) +
  geom_point(alpha=0.2) +
  geom_smooth(method=lm) +
  scale_colour_viridis_d(option="H") +
  theme_bw()

#Attitude * learning style deep
lrn %>%
  ggplot(aes(Attitude, deep, color=gender)) +
  geom_point(alpha=0.2) +
  geom_smooth(method=lm) +
  scale_colour_viridis_d(option="H") +
  theme_bw()

#Attitude * learning strategy
lrn %>%
  ggplot(aes(Attitude, stra, color=gender)) +
  geom_point(alpha=0.2) +
  geom_smooth(method=lm) +
  scale_colour_viridis_d(option="H") +
  theme_bw()

#attitude + surface learning style
lrn %>%
  ggplot(aes(Attitude, surf, color=gender)) +
  geom_point(alpha=0.2) +
  geom_smooth(method=lm) +
  scale_colour_viridis_d(option="H") +
  theme_bw()

#Learning style deep * points

lrn %>%
  ggplot(aes(deep, Points, color=gender)) +
  geom_point(alpha=0.2) +
  geom_smooth(method=lm) +
  scale_colour_viridis_d(option="H") +
  theme_bw()

lrn %>%
  ggplot(aes(stra, Points, color=gender)) +
  geom_point(alpha=0.2) +
  geom_smooth(method=lm) +
  scale_colour_viridis_d(option="H") +
  theme_bw()

lrn %>%
  ggplot(aes(surf, Points, color=gender)) +
  geom_point(alpha=0.2) +
  geom_smooth(method=lm) +
  scale_colour_viridis_d(option="H") +
  theme_bw()

Fitting a regression model

For this exercise we were asked to fit three IV’s to model the relationship to points earned, which is the DV of the model, and remove non significant IV’s in the final model.

I will fit the model (OLS regression) with attitude, learning strategy and surface learning as IV’s, as the association was of these was observed in the above visualisations.

Below we can see the summaries of lfit (with all the three IV’s aforementioned) and lfit2 (attitude as the only IV). We find that that the attitude has a quite of a strong association to the points acquired (B = 3.39, p = <.001). Surface learning strategy is associated negatively to the outcome, and learning strategies positively, however these two were far from significant. Surface learning style and stra were not significantly associated to Points as single predictors either. The multiple R squared is .1 higher in lfit, so the added predictors do account to a tiny bit of the variance explained. However, taking into account the NS of these predictors and that the adjusted R squared change is only <.1, we can be happy with attitude explaining 19% of the variance in acquired course points.

lfit <- lm(Points ~ Attitude + surf + stra, data=lrn)

summary(lfit)
## 
## Call:
## lm(formula = Points ~ Attitude + surf + stra, data = lrn)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -17.1550  -3.4346   0.5156   3.6401  10.8952 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  11.0171     3.6837   2.991  0.00322 ** 
## Attitude      3.3952     0.5741   5.913 1.93e-08 ***
## surf         -0.5861     0.8014  -0.731  0.46563    
## stra          0.8531     0.5416   1.575  0.11716    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.296 on 162 degrees of freedom
## Multiple R-squared:  0.2074, Adjusted R-squared:  0.1927 
## F-statistic: 14.13 on 3 and 162 DF,  p-value: 3.156e-08
#removing NS IV's

lfit2 <- lm(Points ~ Attitude, data=lrn)

summary(lfit2)
## 
## Call:
## lm(formula = Points ~ Attitude, data = lrn)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -16.9763  -3.2119   0.4339   4.1534  10.6645 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  11.6372     1.8303   6.358 1.95e-09 ***
## Attitude      3.5255     0.5674   6.214 4.12e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.32 on 164 degrees of freedom
## Multiple R-squared:  0.1906, Adjusted R-squared:  0.1856 
## F-statistic: 38.61 on 1 and 164 DF,  p-value: 4.119e-09

Diagnostic plots for regression model

Residuals vs fitted plot show us that the assumption of linear association is reasonable. Q-Q plot affirms us the assumption of normality, notwithstanding that a few outliers emerge. According to these diagnostics, the LM assumption of Points ~ Attitude is reasonably valid.

par(mfrow = c(2,2))
plot(lfit2, which=c(1,2,5))


Exercise 3

setwd("C:/Users/rekar/Documents/Road to PhD/Kurssit/Open data science/IODS-project/data")
pormath <- read.csv("pormath.csv")

library(dplyr)
library(tidyverse)
library(GGally)
library(viridis)
library(sjPlot)
library(cowplot)

Data at hand

In this Exercise, the data assessed is retrieved from the Machine Learning Repository, collected by Paolo Cortez. It comprises of two data sets of Secondary school students from Portuguese schools. The data comprises of both school reports and questionnaires. More information about the data set used in this exercise can be found here. The two separate data sets are about performance in maths and Portuguese, respectively. Only the participants who were in the both data sets, are included in the combined data used in this exercise and others excluded. Hereby, the total N of observations is 370.

colnames(pormath)
##  [1] "school"     "sex"        "age"        "address"    "famsize"   
##  [6] "Pstatus"    "Medu"       "Fedu"       "Mjob"       "Fjob"      
## [11] "reason"     "guardian"   "traveltime" "studytime"  "schoolsup" 
## [16] "famsup"     "activities" "nursery"    "higher"     "internet"  
## [21] "romantic"   "famrel"     "freetime"   "goout"      "Dalc"      
## [26] "Walc"       "health"     "n"          "id.p"       "id.m"      
## [31] "failures"   "paid"       "absences"   "G1"         "G2"        
## [36] "G3"         "failures.p" "paid.p"     "absences.p" "G1.p"      
## [41] "G2.p"       "G3.p"       "failures.m" "paid.m"     "absences.m"
## [46] "G1.m"       "G2.m"       "G3.m"       "alc_use"    "high_use"  
## [51] "cid"

Choosing four variables of interest and assessing their association to high consumption of alcohol

The association of the following variables to high alcohol consumption will be assessed. High alcohol consumption here is defined as > 2, which is computed from alcohol consumption during weekdays summed with alcohol consumption during weekends (very low=1, to very high=5) divided by two.

In the following arguments for choosing these variables, I won’t be citing any research, since it is out of the scope of the current exercise. However, for presenting working hypotheses, I shortly argue why I except certain outcomes. Furthermore, I have not familiarised myself with the socio-cultural nor with the contextual factors that should be taken into account when assessing the following factors.

First, parental cohabitation (Pstatus) is an interesting variable, since co-parenting is often associated with better all-round well-being in children and adolescents. I expect cohabiting parenthood being associated with lower levels of alcohol consumption. Second, I find interesting to assess if attending to nursery has an association with alcohol consumption. As I am not familiar with the system regarding nurseries in Portugal, I can only take a wild guess, and except that those who attended nursery school (i.e., pre-school) have been more prepared to start primary school, and therefore experienced less hardships related to the very start of their educational path. Third, I assess the association of number of past class failures with high alcohol consumption. I assume that some adolescents not having resources (social or emotional/instrumental) to cope with failures, may build self-handicapping strategies, which then may turn into a vicious cycle; hence the higher the amount of failures, the higher the probability of high alcohol consumption. Finally, I except the perceived familial relations (famrel) having also an association with the alcohol consumption outcome: the higher the familial relations, the lower the probability of high alcohol consumption.

Graphical and X-tabs examination of associations of selected variables to alcohol consumption

###Select the variables to a new DF####
pormath_s <- pormath %>%
  select(Pstatus, nursery, failures, famrel, sex, age, high_use)

### Change logical high_use to numerical ####
#pormath_s$high_use <- as.numeric(pormath_s$high_use) # 1 = True, 0 = False
 
### Distributions with the discrete variable (alc_use) ####
p1 <- (pormath %>%
  ggplot(aes(Pstatus, alc_use)) +
  see::geom_violinhalf() +
    geom_boxplot(width=0.2, size=0.2, alpha=0.4) +
    stat_summary(fun=mean, geom="point", size=1, color="black", shape=8) +
    stat_summary(fun.data = mean_se, geom = "errorbar", width=0.1, size=0.5) +
  theme_bw() +
  ylab("Alcohol use") +
  xlab("Parents Apart (A) or Together (T)?")
  )

p2 <- (pormath %>%
  ggplot(aes(nursery, alc_use)) +
  see::geom_violinhalf() +
    geom_boxplot(width=0.2, size=0.2, alpha=0.4) +
    stat_summary(fun=mean, geom="point", size=1, color="black", shape=8) +
    stat_summary(fun.data = mean_se, geom = "errorbar", width=0.1, size=0.5) +
  theme_bw() +
  ylab("Alcohol use") +
  xlab("Attended nursery school?")
)

p3 <- (pormath %>%
  ggplot(aes(failures, alc_use, color=alc_use)) +
    stat_summary(fun=mean, geom="point", size=1, color="black", shape=8) +
    stat_summary(fun.data = mean_se, geom = "errorbar", width=0.1, size=0.5) +
  theme_bw() +
  ylab("Alcohol use") +
  xlab("Number of past Class failures")
)


p4 <- (pormath %>%
  ggplot(aes(famrel, alc_use)) +
    stat_summary(fun=mean, geom="point", size=1, color="black", shape=8) +
    stat_summary(fun.data = mean_se, geom = "errorbar", width=0.1, size=0.5) +
  theme_bw() +
  ylab("Alcohol use") +
  xlab("Familial relations \n (1 = very bad, 5 = excellent)")
)

plot_grid(p1, p2, p3, p4)

### Cross-tabulations ####


tab_xtab(var.row = pormath_s$Pstatus, var.col = pormath_s$high_use, title = "Parents Cohabiting?", show.row.prc = TRUE)
Parents Cohabiting?
Pstatus high_use Total
FALSE TRUE
A 26
68.4 %
12
31.6 %
38
100 %
T 233
70.2 %
99
29.8 %
332
100 %
Total 259
70 %
111
30 %
370
100 %
χ2=0.001 · df=1 · φ=0.012 · p=0.970
tab_xtab(var.row = pormath_s$nursery, var.col = pormath_s$high_use, title = "Attended to Nursery School?", show.row.prc = TRUE)
Attended to Nursery School?
nursery high_use Total
FALSE TRUE
no 46
63.9 %
26
36.1 %
72
100 %
yes 213
71.5 %
85
28.5 %
298
100 %
Total 259
70 %
111
30 %
370
100 %
χ2=1.249 · df=1 · φ=0.066 · p=0.264
tab_xtab(var.row = pormath_s$failures, var.col = pormath_s$high_use, title = "Number of past Class failures", show.row.prc = TRUE)
Number of past Class failures
failures high_use Total
FALSE TRUE
0 238
73.2 %
87
26.8 %
325
100 %
1 12
50 %
12
50 %
24
100 %
2 8
47.1 %
9
52.9 %
17
100 %
3 1
25 %
3
75 %
4
100 %
Total 259
70 %
111
30 %
370
100 %
χ2=14.304 · df=3 · Cramer’s V=0.197 · Fisher’s p=0.003
tab_xtab(var.row = pormath_s$famrel, var.col = pormath_s$high_use, title = "Familial relations", show.row.prc = TRUE)
Familial relations
famrel high_use Total
FALSE TRUE
1 6
75 %
2
25 %
8
100 %
2 9
50 %
9
50 %
18
100 %
3 39
60.9 %
25
39.1 %
64
100 %
4 128
71.1 %
52
28.9 %
180
100 %
5 77
77 %
23
23 %
100
100 %
Total 259
70 %
111
30 %
370
100 %
χ2=8.466 · df=4 · Cramer’s V=0.151 · Fisher’s p=0.068

Fitting a Logistic Regression

Of the expectations, only two of the variables had statistically signicant relationship with high consumption of alcohol, namely number of past class failures with a positive prediction (CI 95 % for OR = 1.31 — 2.86) and family relations with a negative prediction (CI 95% for OR = 0.60 — 0.98). The other two variables had no statistically significant relationship with high consumption of alcohol.

m1 <- glm(high_use ~ Pstatus + nursery + failures + famrel, family="binomial", data=pormath_s)
jtools::summ(m1)
## MODEL INFO:
## Observations: 370
## Dependent Variable: high_use
## Type: Generalized linear model
##   Family: binomial 
##   Link function: logit 
## 
## MODEL FIT:
## <U+03C7>²(4) = 18.16, p = 0.00
## Pseudo-R² (Cragg-Uhler) = 0.07
## Pseudo-R² (McFadden) = 0.04
## AIC = 443.88, BIC = 463.45 
## 
## Standard errors: MLE
## ------------------------------------------------
##                      Est.   S.E.   z val.      p
## ----------------- ------- ------ -------- ------
## (Intercept)          0.39   0.64     0.61   0.54
## PstatusT            -0.10   0.38    -0.27   0.79
## nurseryyes          -0.31   0.29    -1.08   0.28
## failures             0.65   0.20     3.29   0.00
## famrel              -0.27   0.12    -2.14   0.03
## ------------------------------------------------
OR1 <- coef(m1) %>% exp
CI1 <- confint(m1) %>% exp
cbind(OR1, CI1)
##                   OR1     2.5 %    97.5 %
## (Intercept) 1.4764105 0.4176121 5.1647164
## PstatusT    0.9038712 0.4366755 1.9686682
## nurseryyes  0.7345732 0.4215456 1.3009277
## failures    1.9163856 1.3101788 2.8641095
## famrel      0.7654375 0.5985101 0.9775722

Cross-tabulated predictions

Here I further examine the predictive power of the predictors, that were significant in the above model. The predictive power is fair, as it predicted about 30% of the outcomes wrong.

# predict() the probability of high_use
probabilities <- predict(m1, type = "response")

# add the predicted probabilities to 'alc'
pormath_s <- mutate(pormath_s, probability = probabilities)

# use the probabilities to make a prediction of high_use
pormath_s <- mutate(pormath_s, prediction = probability > 0.5)


# tabulate the target variable versus the predictions
table(high_use = pormath_s$high_use, prediction = pormath_s$prediction)
##         prediction
## high_use FALSE TRUE
##    FALSE   248   11
##    TRUE     99   12
ggplot(pormath_s, aes(probability, high_use, col=prediction)) +
  geom_point() 

table(high_use = pormath_s$high_use, prediction = pormath_s$prediction) %>% prop.table %>% addmargins
##         prediction
## high_use      FALSE       TRUE        Sum
##    FALSE 0.67027027 0.02972973 0.70000000
##    TRUE  0.26756757 0.03243243 0.30000000
##    Sum   0.93783784 0.06216216 1.00000000
loss_func <- function(class, prob) {
  n_wrong <- abs(class - prob) > 0.5
  mean(n_wrong)
}

# call loss_func to compute the average number of wrong predictions in the (training) data
loss_func(class = pormath_s$high_use, prob = pormath_s$probability)
## [1] 0.2972973

Cross-Validating the model

The ten-fold cross-validation shows a close performance of the predictive power of this model in the training and test sets (about 31% wrong) compared to to testing it with the whole data (30 % wrong).

loss_func(class = pormath_s$high_use, prob = pormath_s$probability)
## [1] 0.2972973
cv <- boot::cv.glm(data = pormath_s, cost = loss_func, glmfit = m1, K = 10)

cv$delta
## [1] 0.3027027 0.3024324

Trying to find a better model for predictive power

Here I compare two different models to the M1 tested above. Before evaluating predictive power further, I examine these models by BIC change and R^2 -change. We find the third model below being the best of the three according to BIC and R^2 measures.

Further, the prediction power of M3 (cross-validated error rate about 22%) exceeds the power of the first model (M1, about 30% error rate) tested above, and moreover exceeds the predictive power of the model introduced in the course Datacamp (about 26% error rate).

According to this model (M3, last summary below), males had higher probability of high alcohol consumption compared to their female counterparts (CI 95 % for OR = 1.66 — 4.76). Second, adolescents who spent more time outside, were also more likely to exceed our cut-off point for high alcohol consumption (CI 95 % for OR = 1.68 — 2.77). Third, adolescents living in urban areas were less likely to do so compared to their counterparts living in rural areas (CI 95 % for OR = 0.28 — 0.97). Fourth, adolescents who had failed classes, were more likely to do so (CI 95 % for OR = 0.95 — 2.37), however it is worth noting, that predicting alcohol consumption based on experienced failures with this data builds on a very few observations (see observations in each class from the cross-tabulations above). Fifth, adolescents with good familial relations were less likely to do so (CI 95 % for OR = 0.50 — 0.89). Finally, adolescents with more absences were also more likely to consume more alcohol, however the OR was quite modest (CI 95 % for OR = 1.04 — 1.13).

jtools::summ(m1)
## MODEL INFO:
## Observations: 370
## Dependent Variable: high_use
## Type: Generalized linear model
##   Family: binomial 
##   Link function: logit 
## 
## MODEL FIT:
## <U+03C7>²(4) = 18.16, p = 0.00
## Pseudo-R² (Cragg-Uhler) = 0.07
## Pseudo-R² (McFadden) = 0.04
## AIC = 443.88, BIC = 463.45 
## 
## Standard errors: MLE
## ------------------------------------------------
##                      Est.   S.E.   z val.      p
## ----------------- ------- ------ -------- ------
## (Intercept)          0.39   0.64     0.61   0.54
## PstatusT            -0.10   0.38    -0.27   0.79
## nurseryyes          -0.31   0.29    -1.08   0.28
## failures             0.65   0.20     3.29   0.00
## famrel              -0.27   0.12    -2.14   0.03
## ------------------------------------------------
jtools::summ(glm(high_use ~ failures + sex + address + famrel + freetime + goout + internet, family="binomial", data=pormath)) #freetime and internet access not significant; removing them.
## MODEL INFO:
## Observations: 370
## Dependent Variable: high_use
## Type: Generalized linear model
##   Family: binomial 
##   Link function: logit 
## 
## MODEL FIT:
## <U+03C7>²(7) = 83.08, p = 0.00
## Pseudo-R² (Cragg-Uhler) = 0.29
## Pseudo-R² (McFadden) = 0.18
## AIC = 384.96, BIC = 416.27 
## 
## Standard errors: MLE
## ------------------------------------------------
##                      Est.   S.E.   z val.      p
## ----------------- ------- ------ -------- ------
## (Intercept)         -2.23   0.77    -2.89   0.00
## failures             0.45   0.23     1.99   0.05
## sexM                 0.89   0.26     3.38   0.00
## addressU            -0.70   0.32    -2.22   0.03
## famrel              -0.43   0.14    -3.00   0.00
## freetime             0.11   0.14     0.75   0.46
## goout                0.76   0.13     5.90   0.00
## internetyes          0.26   0.38     0.69   0.49
## ------------------------------------------------
jtools::summ(glm(high_use ~ failures + sex + address + famrel  + goout + absences, family="binomial", data=pormath))
## MODEL INFO:
## Observations: 370
## Dependent Variable: high_use
## Type: Generalized linear model
##   Family: binomial 
##   Link function: logit 
## 
## MODEL FIT:
## <U+03C7>²(6) = 94.64, p = 0.00
## Pseudo-R² (Cragg-Uhler) = 0.32
## Pseudo-R² (McFadden) = 0.21
## AIC = 371.39, BIC = 398.79 
## 
## Standard errors: MLE
## ------------------------------------------------
##                      Est.   S.E.   z val.      p
## ----------------- ------- ------ -------- ------
## (Intercept)         -2.27   0.71    -3.22   0.00
## failures             0.40   0.23     1.73   0.08
## sexM                 1.03   0.27     3.82   0.00
## addressU            -0.65   0.31    -2.07   0.04
## famrel              -0.40   0.15    -2.79   0.01
## goout                0.76   0.13     6.02   0.00
## absences             0.08   0.02     3.54   0.00
## ------------------------------------------------
m3 <- glm(high_use ~ failures + sex + address + famrel  + goout + absences, family="binomial", data=pormath)
exp(cbind(OR = coef(m3), confint(m3)))
##                    OR     2.5 %    97.5 %
## (Intercept) 0.1030586 0.0248787 0.4001095
## failures    1.4913178 0.9538249 2.3728914
## sexM        2.7874285 1.6580283 4.7602275
## addressU    0.5238851 0.2836255 0.9691226
## famrel      0.6672113 0.5004071 0.8856192
## goout       2.1448371 1.6844140 2.7725681
## absences    1.0816926 1.0365062 1.1324925
# predict() the probability of high_use
probabilities3 <- predict(m3, type = "response")

# add the predicted probabilities 
pormath <- mutate(pormath, probability = probabilities3)

# use the probabilities to make a prediction of high_use
pormath <- mutate(pormath, prediction = probability > 0.5)


# tabulate the target variable versus the predictions
table(high_use = pormath$high_use, prediction = pormath_s$prediction)
##         prediction
## high_use FALSE TRUE
##    FALSE   248   11
##    TRUE     99   12
ggplot(pormath, aes(probability, high_use, col=prediction)) +
  geom_point() 

table(high_use = pormath$high_use, prediction = pormath$prediction) %>% prop.table %>% addmargins
##         prediction
## high_use      FALSE       TRUE        Sum
##    FALSE 0.64054054 0.05945946 0.70000000
##    TRUE  0.15135135 0.14864865 0.30000000
##    Sum   0.79189189 0.20810811 1.00000000
loss_func1 <- function(class, prob) {
  n_wrong <- abs(class - prob) > 0.5
  mean(n_wrong)
}

# call loss_func to compute the average number of wrong predictions in the (training) data
loss_func1(class = pormath$high_use, prob = pormath$probability)
## [1] 0.2108108
#cross-validate
cv2 <- boot::cv.glm(data = pormath, cost = loss_func, glmfit = m3, K = 10)
#print error-rate of cross-validation
cv2$delta
## [1] 0.2216216 0.2254054

Exercise 4

In this exercise the data used is the Boston dataset included in the MASS package. It comprises of 506 observations of 14 variables. The original rationale of the data was to predict housing prices by the NOx-levels by the area (see Harrison & Rubinfield, 1978). The 506 observations are of census tracts, and not districts as I supposed by the structure of the data. Details of the variables can be found here

Further, the rationale of this exercise is to find appropriate number of clusters.

Below you find the summary of the variables, overview of the data and density plots of the variables and finally the correlation plot of the data.

From these figures we can see, that accessibility to radial highways (rad) correlates strongly, (r = .91), with full-value property-tax rate (tax). The NOx levels seem to be negatively correlated to distance to Boston employment centers (r = -.75), in other word, the further from these centres, the lower the NOx-levels. Further, it seems like that the older the buildings the higher the NOx levels (r = .73), which I assume might correspond to city-center areas. With eyeballing these figures, it could be argued that the data might be clustered to two clusters which correspond to the shapes, peaks and polarisation seen in the density plots.

library(tidyverse)
library(corrplot)
library(MASS)
library(viridis)
library(GGally)
data("Boston")

str(Boston)
## 'data.frame':    506 obs. of  14 variables:
##  $ crim   : num  0.00632 0.02731 0.02729 0.03237 0.06905 ...
##  $ zn     : num  18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
##  $ indus  : num  2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
##  $ chas   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ nox    : num  0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
##  $ rm     : num  6.58 6.42 7.18 7 7.15 ...
##  $ age    : num  65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
##  $ dis    : num  4.09 4.97 4.97 6.06 6.06 ...
##  $ rad    : int  1 2 2 3 3 3 5 5 5 5 ...
##  $ tax    : num  296 242 242 222 222 222 311 311 311 311 ...
##  $ ptratio: num  15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
##  $ black  : num  397 397 393 395 397 ...
##  $ lstat  : num  4.98 9.14 4.03 2.94 5.33 ...
##  $ medv   : num  24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
summary(Boston)
##       crim                zn             indus            chas        
##  Min.   : 0.00632   Min.   :  0.00   Min.   : 0.46   Min.   :0.00000  
##  1st Qu.: 0.08205   1st Qu.:  0.00   1st Qu.: 5.19   1st Qu.:0.00000  
##  Median : 0.25651   Median :  0.00   Median : 9.69   Median :0.00000  
##  Mean   : 3.61352   Mean   : 11.36   Mean   :11.14   Mean   :0.06917  
##  3rd Qu.: 3.67708   3rd Qu.: 12.50   3rd Qu.:18.10   3rd Qu.:0.00000  
##  Max.   :88.97620   Max.   :100.00   Max.   :27.74   Max.   :1.00000  
##       nox               rm             age              dis        
##  Min.   :0.3850   Min.   :3.561   Min.   :  2.90   Min.   : 1.130  
##  1st Qu.:0.4490   1st Qu.:5.886   1st Qu.: 45.02   1st Qu.: 2.100  
##  Median :0.5380   Median :6.208   Median : 77.50   Median : 3.207  
##  Mean   :0.5547   Mean   :6.285   Mean   : 68.57   Mean   : 3.795  
##  3rd Qu.:0.6240   3rd Qu.:6.623   3rd Qu.: 94.08   3rd Qu.: 5.188  
##  Max.   :0.8710   Max.   :8.780   Max.   :100.00   Max.   :12.127  
##       rad              tax           ptratio          black       
##  Min.   : 1.000   Min.   :187.0   Min.   :12.60   Min.   :  0.32  
##  1st Qu.: 4.000   1st Qu.:279.0   1st Qu.:17.40   1st Qu.:375.38  
##  Median : 5.000   Median :330.0   Median :19.05   Median :391.44  
##  Mean   : 9.549   Mean   :408.2   Mean   :18.46   Mean   :356.67  
##  3rd Qu.:24.000   3rd Qu.:666.0   3rd Qu.:20.20   3rd Qu.:396.23  
##  Max.   :24.000   Max.   :711.0   Max.   :22.00   Max.   :396.90  
##      lstat            medv      
##  Min.   : 1.73   Min.   : 5.00  
##  1st Qu.: 6.95   1st Qu.:17.02  
##  Median :11.36   Median :21.20  
##  Mean   :12.65   Mean   :22.53  
##  3rd Qu.:16.95   3rd Qu.:25.00  
##  Max.   :37.97   Max.   :50.00
pairs(Boston)

Boston %>%
  gather(key=var_name, value = value) %>%
  ggplot(aes(x=value)) +
  geom_density() +
  facet_wrap(~var_name, scales="free") + theme_bw()

cor_matrix <- cor(Boston) %>% round(digits = 2)
cor_matrix
##          crim    zn indus  chas   nox    rm   age   dis   rad   tax ptratio
## crim     1.00 -0.20  0.41 -0.06  0.42 -0.22  0.35 -0.38  0.63  0.58    0.29
## zn      -0.20  1.00 -0.53 -0.04 -0.52  0.31 -0.57  0.66 -0.31 -0.31   -0.39
## indus    0.41 -0.53  1.00  0.06  0.76 -0.39  0.64 -0.71  0.60  0.72    0.38
## chas    -0.06 -0.04  0.06  1.00  0.09  0.09  0.09 -0.10 -0.01 -0.04   -0.12
## nox      0.42 -0.52  0.76  0.09  1.00 -0.30  0.73 -0.77  0.61  0.67    0.19
## rm      -0.22  0.31 -0.39  0.09 -0.30  1.00 -0.24  0.21 -0.21 -0.29   -0.36
## age      0.35 -0.57  0.64  0.09  0.73 -0.24  1.00 -0.75  0.46  0.51    0.26
## dis     -0.38  0.66 -0.71 -0.10 -0.77  0.21 -0.75  1.00 -0.49 -0.53   -0.23
## rad      0.63 -0.31  0.60 -0.01  0.61 -0.21  0.46 -0.49  1.00  0.91    0.46
## tax      0.58 -0.31  0.72 -0.04  0.67 -0.29  0.51 -0.53  0.91  1.00    0.46
## ptratio  0.29 -0.39  0.38 -0.12  0.19 -0.36  0.26 -0.23  0.46  0.46    1.00
## black   -0.39  0.18 -0.36  0.05 -0.38  0.13 -0.27  0.29 -0.44 -0.44   -0.18
## lstat    0.46 -0.41  0.60 -0.05  0.59 -0.61  0.60 -0.50  0.49  0.54    0.37
## medv    -0.39  0.36 -0.48  0.18 -0.43  0.70 -0.38  0.25 -0.38 -0.47   -0.51
##         black lstat  medv
## crim    -0.39  0.46 -0.39
## zn       0.18 -0.41  0.36
## indus   -0.36  0.60 -0.48
## chas     0.05 -0.05  0.18
## nox     -0.38  0.59 -0.43
## rm       0.13 -0.61  0.70
## age     -0.27  0.60 -0.38
## dis      0.29 -0.50  0.25
## rad     -0.44  0.49 -0.38
## tax     -0.44  0.54 -0.47
## ptratio -0.18  0.37 -0.51
## black    1.00 -0.37  0.33
## lstat   -0.37  1.00 -0.74
## medv     0.33 -0.74  1.00
corrplot(cor_matrix, method="color", type = "lower", cl.pos="b", tl.pos="d", tl.cex=0.7, cl.cex = 0.7, col = viridis(n=100), tl.col = "black" ,addCoef.col=T, number.cex=0.5)

First, the data will be scaled (x - mean(x)) / sd(x)). Then we create a categorical variable of the crime-variable with quantiles as break points. Then we remove the old crim variable, and add the categorised crime variable into the dataframe instead.

boston_scaled <- scale(Boston)

boston_scaled <- as.data.frame(boston_scaled)

summary(boston_scaled)
##       crim                 zn               indus              chas        
##  Min.   :-0.419367   Min.   :-0.48724   Min.   :-1.5563   Min.   :-0.2723  
##  1st Qu.:-0.410563   1st Qu.:-0.48724   1st Qu.:-0.8668   1st Qu.:-0.2723  
##  Median :-0.390280   Median :-0.48724   Median :-0.2109   Median :-0.2723  
##  Mean   : 0.000000   Mean   : 0.00000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.007389   3rd Qu.: 0.04872   3rd Qu.: 1.0150   3rd Qu.:-0.2723  
##  Max.   : 9.924110   Max.   : 3.80047   Max.   : 2.4202   Max.   : 3.6648  
##       nox                rm               age               dis         
##  Min.   :-1.4644   Min.   :-3.8764   Min.   :-2.3331   Min.   :-1.2658  
##  1st Qu.:-0.9121   1st Qu.:-0.5681   1st Qu.:-0.8366   1st Qu.:-0.8049  
##  Median :-0.1441   Median :-0.1084   Median : 0.3171   Median :-0.2790  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.5981   3rd Qu.: 0.4823   3rd Qu.: 0.9059   3rd Qu.: 0.6617  
##  Max.   : 2.7296   Max.   : 3.5515   Max.   : 1.1164   Max.   : 3.9566  
##       rad               tax             ptratio            black        
##  Min.   :-0.9819   Min.   :-1.3127   Min.   :-2.7047   Min.   :-3.9033  
##  1st Qu.:-0.6373   1st Qu.:-0.7668   1st Qu.:-0.4876   1st Qu.: 0.2049  
##  Median :-0.5225   Median :-0.4642   Median : 0.2746   Median : 0.3808  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 1.6596   3rd Qu.: 1.5294   3rd Qu.: 0.8058   3rd Qu.: 0.4332  
##  Max.   : 1.6596   Max.   : 1.7964   Max.   : 1.6372   Max.   : 0.4406  
##      lstat              medv        
##  Min.   :-1.5296   Min.   :-1.9063  
##  1st Qu.:-0.7986   1st Qu.:-0.5989  
##  Median :-0.1811   Median :-0.1449  
##  Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.6024   3rd Qu.: 0.2683  
##  Max.   : 3.5453   Max.   : 2.9865
bins <- quantile(boston_scaled$crim)
crime <- cut(boston_scaled$crim, breaks=bins, include.lowest=T, label=c("low", "med_low", "med_high", "high"))
table(crime)
## crime
##      low  med_low med_high     high 
##      127      126      126      127
boston_scaled <- dplyr::select(boston_scaled, -crim)
boston_scaled <- data.frame(boston_scaled, crime)

Next the data will be divided to train and test sets.

n <- 506
ind <- sample(n, size = n * 0.8)
train <- boston_scaled[ind,]
test<-boston_scaled[-ind,]
correct_classes <- test$crime
test <- dplyr::select(test, -crime)

Next I run a LDA discriminant analysis. The analysis identifies the high crime rate class rather well, however a few med_high classes would be wrongly classified. The predictions among low crime category is also good. However, in the mid categories, the classification could be better. Rad, ie., access to radial highways, is the strongest predictor in this model. Proportion of residential land and NOx being the second largest estimates.

# MASS and train are available

# linear discriminant analysis
lda.fit <- lda(crime ~., data = train)

# print the lda.fit object
lda.fit
## Call:
## lda(crime ~ ., data = train)
## 
## Prior probabilities of groups:
##       low   med_low  med_high      high 
## 0.2623762 0.2400990 0.2425743 0.2549505 
## 
## Group means:
##                   zn      indus        chas        nox         rm        age
## low       0.94550337 -0.9113049 -0.08661679 -0.8799680  0.4774434 -0.8814792
## med_low  -0.09206328 -0.3150893 -0.02879709 -0.5722726 -0.1261929 -0.3847239
## med_high -0.37173443  0.1214695  0.16959035  0.2854773  0.1642232  0.3496570
## high     -0.48724019  1.0170891 -0.08120770  1.0891473 -0.4468777  0.8297720
##                 dis        rad        tax     ptratio      black       lstat
## low       0.8261448 -0.6882537 -0.7436993 -0.48755665  0.3875502 -0.77903790
## med_low   0.3697808 -0.5437961 -0.4849496 -0.08565135  0.3279776 -0.14675853
## med_high -0.3263681 -0.3900591 -0.2965035 -0.16328022  0.1315624 -0.06035831
## high     -0.8397379  1.6384176  1.5142626  0.78111358 -0.7395864  0.89099707
##                 medv
## low       0.56213606
## med_low  -0.02049302
## med_high  0.22054966
## high     -0.72667349
## 
## Coefficients of linear discriminants:
##                  LD1          LD2         LD3
## zn       0.083996778  0.663842710 -0.87404743
## indus    0.007118397 -0.257469431  0.36162046
## chas    -0.087003704 -0.053228386  0.06822034
## nox      0.497021117 -0.804277905 -1.37469582
## rm      -0.089507406 -0.073338199 -0.16005020
## age      0.245543799 -0.270059998 -0.16872710
## dis      0.026591819 -0.268187693  0.15876190
## rad      2.903080486  1.140265129  0.09509299
## tax      0.043333306 -0.127764129  0.37883694
## ptratio  0.118130448 -0.142815306 -0.38903761
## black   -0.123539774 -0.008512562  0.10116545
## lstat    0.267979096 -0.280132977  0.38803245
## medv     0.215683028 -0.508477333 -0.30261083
## 
## Proportion of trace:
##    LD1    LD2    LD3 
## 0.9504 0.0367 0.0130
# the function for lda biplot arrows
lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "blue", tex = 0.75, choices = c(1,2)){
  heads <- coef(x)
  arrows(x0 = 0, y0 = 0, 
         x1 = myscale * heads[,choices[1]], 
         y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
  text(myscale * heads[,choices], labels = row.names(heads), 
       cex = tex, col=color, pos=3)
}

# target classes as numeric
classes <- as.numeric(train$crime)

# plot the lda results
plot(lda.fit, dimen = 2, col = classes, pch = classes)
lda.arrows(lda.fit, myscale = 2.2)

lda.pred <- predict(lda.fit, newdata = test)
table(correct = correct_classes, predicted = lda.pred$class)
##           predicted
## correct    low med_low med_high high
##   low       15       4        2    0
##   med_low    7      16        6    0
##   med_high   0       6       22    0
##   high       0       0        0   24

Next I re-run and scale the Boston data set and calculate distances between the observations.

data("Boston")

boston_scaled_k <- scale(Boston)

boston_scaled_k <- as.data.frame(boston_scaled_k)

summary(dist(boston_scaled_k))
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1343  3.4625  4.8241  4.9111  6.1863 14.3970

Next the data will be clustered with K-means. The largest drop in the plot appears at two (2) clusters. Therefore, a two-cluster solution test follows.

kmax <- 10
set.seed(123)
totws <- sapply(1:kmax, function(k){kmeans(boston_scaled_k, k)$tot.withinss})
qplot(x = 1:kmax, y=totws, geom='line') + theme_bw() +
  ylab("Total within sum of squares")

A two cluster solution is plotted below. As anticipated from the density plots before, it looks like that the same variables correspond to the two cluster solution that were eminent from the density-plots. Non-retail business acres per town (indus) and property-tax rate (tax) can be seen very clearly. Also NOx levels seem to be nicely clustered to different clusters. The density plots are shown here aswell for a scroll-free comparison.

kme <- kmeans(boston_scaled_k, centers = 2)
pairs(boston_scaled_k, col=kme$cluster)

boston_scaled_k %>%
  gather(key=var_name, value = value) %>%
  ggplot(aes(x=value)) +
  geom_density() +
  facet_wrap(~var_name, scales="free") + theme_bw()


Exercise 5

In this exercise, PCA and MCA analysis are practiced with two different data sets.

PCA and HDI-data

For the PCA, data used is the Human Development (HDI) index of UNDP. The data used in this exercise has observations from 155 countries of eight variables listed below. Further information of the data set and the variables can be found here.

As variables used in this exercise have been wrangled further, below follows an explanation of each variable:

  • Edu2.FM is the ratio of female and male populations having secondary education.
  • Labo.FM is the ratio of males and females in labor force.
  • Edu.Exp is the expected years in education
  • Life.Exp Life expectancy at birth
  • GNI Gross National Income per Capita
  • Mat.Mor Maternal mortality rate per 100 000 births
  • Ado.Birth Adolescent birth rate per 1000 women ages 15–19
  • Parli.F % of females in the parliament
h <- read.csv('./data/human.csv', header= T, row.names = 1)
library(GGally)
library(corrplot)
library(tidyverse)
library(dplyr)
library(viridis)

str(h)
## 'data.frame':    155 obs. of  8 variables:
##  $ Edu2.FM  : num  1.007 0.997 0.983 0.989 0.969 ...
##  $ Labo.FM  : num  0.891 0.819 0.825 0.884 0.829 ...
##  $ Edu.Exp  : num  17.5 20.2 15.8 18.7 17.9 16.5 18.6 16.5 15.9 19.2 ...
##  $ Life.Exp : num  81.6 82.4 83 80.2 81.6 80.9 80.9 79.1 82 81.8 ...
##  $ GNI      : int  64992 42261 56431 44025 45435 43919 39568 52947 42155 32689 ...
##  $ Mat.Mor  : int  4 6 6 5 6 7 9 28 11 8 ...
##  $ Ado.Birth: num  7.8 12.1 1.9 5.1 6.2 3.8 8.2 31 14.5 25.3 ...
##  $ Parli.F  : num  39.6 30.5 28.5 38 36.9 36.9 19.9 19.4 28.2 31.4 ...

Following, you can find summaries and figures of the data at hand. We find, that there are large discrepancies in the data set, i.e., large inequality in between countries. From summaries, we see that more males are given the opportunity to finish secondary education compared to females and males are also more often in the labor force. The excepted years in education varies from 5 to 20 years, both mean and median being around 13 years. Life expectancy ranges from 49 to 83.5, maternal mortality rate from 0,001 % to 1,1 %. Adolescent birth-rate varies from 0,06 % to 20,5 %. Also, there are parliaments with no females, to parliaments having 57.5% female representatives.

From correlations, we see that maternal mortality rate is related to adolescent birth rate (r = .76). Expected years in education is positively correlated with life expectancy (r = .79), GNI (r = .62) and negatively with both maternal mortality (r = -.74) and adolescent birth rate (r = -.70), indicating that both expected years in education and life expectancy being properties of countries with higher GNI.

summary(h)
##     Edu2.FM          Labo.FM          Edu.Exp         Life.Exp    
##  Min.   :0.1717   Min.   :0.1857   Min.   : 5.40   Min.   :49.00  
##  1st Qu.:0.7264   1st Qu.:0.5984   1st Qu.:11.25   1st Qu.:66.30  
##  Median :0.9375   Median :0.7535   Median :13.50   Median :74.20  
##  Mean   :0.8529   Mean   :0.7074   Mean   :13.18   Mean   :71.65  
##  3rd Qu.:0.9968   3rd Qu.:0.8535   3rd Qu.:15.20   3rd Qu.:77.25  
##  Max.   :1.4967   Max.   :1.0380   Max.   :20.20   Max.   :83.50  
##       GNI            Mat.Mor         Ado.Birth         Parli.F     
##  Min.   :   581   Min.   :   1.0   Min.   :  0.60   Min.   : 0.00  
##  1st Qu.:  4198   1st Qu.:  11.5   1st Qu.: 12.65   1st Qu.:12.40  
##  Median : 12040   Median :  49.0   Median : 33.60   Median :19.30  
##  Mean   : 17628   Mean   : 149.1   Mean   : 47.16   Mean   :20.91  
##  3rd Qu.: 24512   3rd Qu.: 190.0   3rd Qu.: 71.95   3rd Qu.:27.95  
##  Max.   :123124   Max.   :1100.0   Max.   :204.80   Max.   :57.50
my_fn <- function(data, mapping, method="p", use="pairwise", ...){

              # grab data
              x <- eval_data_col(data, mapping$x)
              y <- eval_data_col(data, mapping$y)

              # calculate correlation
              corr <- cor(x, y, method=method, use=use)

              # calculate colour based on correlation value
              # Here I have set a correlation of minus one to blue, 
              # zero to white, and one to red 
              # Change this to suit: possibly extend to add as an argument of `my_fn`
              colFn <- colorRampPalette(c("blue", "white", "red"), interpolate ='spline')
              fill <- colFn(100)[findInterval(corr, seq(-1, 1, length=100))]

              ggally_cor(data = data, mapping = mapping, ...) + 
                theme_void() +
                theme(panel.background = element_rect(fill=fill))
            } #this function was written by user20650 from stackoverflow, #https://stackoverflow.com/questions/45873483/ggpairs-plot-with-heatmap-of-correlation-values


ggpairs(h, 
        upper = list(continuous = my_fn), lower=list(combo=wrap("facethist", binwidth=20, size=1)))

h %>% 
  gather(key=var_name, value = value) %>% 
  ggplot(aes(x=value)) +
  geom_histogram() +
  facet_wrap(~var_name, scales = "free_x") +
  theme_bw() 

First we run PCA on non-scaled data, which naturally doesn’t make much sense since the scale of GNI is out of the roof compared to other scales (see graph below). It would imply that GNI alone explained 100% of the variance in the data. Therefore it is reasonable to scale the data before doing PCA, which follows.

pca_h <- prcomp(h)


s <- summary(pca_h)
s
## Importance of components:
##                              PC1      PC2   PC3   PC4   PC5   PC6    PC7    PC8
## Standard deviation     1.854e+04 185.5219 25.19 11.45 3.766 1.566 0.1912 0.1591
## Proportion of Variance 9.999e-01   0.0001  0.00  0.00 0.000 0.000 0.0000 0.0000
## Cumulative Proportion  9.999e-01   1.0000  1.00  1.00 1.000 1.000 1.0000 1.0000
pca_pr1 <- round(100*s$importance[2,], digits = 1) 

pca_pr1
## PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8 
## 100   0   0   0   0   0   0   0
pc_lab <- paste0(names(pca_pr1), " (", pca_pr1, "%)")


biplot(pca_h, choises=1:2, cex=c(0.8,1), col=c("slategray3","royalblue3"), xlab=pc_lab[1], ylab=pc_lab[2])

Therefore it is reasonable to scale the data before doing PCA.

On the bi-plot below, the arrows correspond to correlations. We see, that Edu.Exp, Life.Exp, Edu2.FM and GNI have a positive correlation, and these are negatively correlated to mat.mor and ado.birth. First component comprises of these aforementioned factors explaining 53.6 % of the variance, and second component of the male-female ratios in labor force and parliament explaining 16.2 % of the variance.

h_scaled <- scale(h)

pca_h_s <- prcomp(h_scaled)


s2 <- summary(pca_h_s)
s2
## Importance of components:
##                           PC1    PC2     PC3     PC4     PC5     PC6     PC7
## Standard deviation     2.0708 1.1397 0.87505 0.77886 0.66196 0.53631 0.45900
## Proportion of Variance 0.5361 0.1624 0.09571 0.07583 0.05477 0.03595 0.02634
## Cumulative Proportion  0.5361 0.6984 0.79413 0.86996 0.92473 0.96069 0.98702
##                            PC8
## Standard deviation     0.32224
## Proportion of Variance 0.01298
## Cumulative Proportion  1.00000
pca_pr2 <- round(100*s2$importance[2,], digits = 1) 

pca_pr2
##  PC1  PC2  PC3  PC4  PC5  PC6  PC7  PC8 
## 53.6 16.2  9.6  7.6  5.5  3.6  2.6  1.3
pc_lab2 <- paste0(names(pca_pr2), " (", pca_pr2, "%)")


biplot(pca_h_s, choises=1:2, cex=c(0.8, 1), col=c("slategray3","royalblue3" ), xlab=pc_lab2[1], ylab=pc_lab2[2])

MCA and Tea-data

The Tea data set from FactoMineR-library has 300 observations of 36 variables. The data set is about tea preferences. It is worth to note that there is not much documentation on this data set available, so some of my assumptions are pure guesses at best.

First we do a MCA on the whole data, excluding the variable of age as it was not a factor. As the plot doesn’t make much sense, I chose a few variables that I assumed to correspond to different dimensions.

I chose the following variables:

  • sport (if the individual does sports),
  • diuretic (if the tea is diuretic; I assume that people doing sports would avoid diuretic teas; or maybe take advantage of them in the case of combat-sports weight-ins),
  • frequency of drinking tea
  • where they buy their teas
  • how (using teabags, unpackaged or both)
  • sugar (drinking with or without sugar)
  • slimming () relaxing effect on health feminine sex *age group

With these factors, the explained variance is enhanced on dimension 1. On dimension one, the factors correspond to being young or old, sportsman or not, the tea being diuretic or not. In a sense, correspondance to lifestyles can be seen here. On the second dimension shows a loading between those who prefer quality/speciality rather than big brands; the ones who prefer buying their teas from tea shops and unpacked versus those who buy teabags from chainstores.

Further, with plotellipses-function it is possible to draw confidence ellipses. There we can confirm that the type of the tea and where it is bought are most clearly separate groups from each other.

library(FactoMineR)
data("tea")
str(tea) #drop age, since it is not a factor
## 'data.frame':    300 obs. of  36 variables:
##  $ breakfast       : Factor w/ 2 levels "breakfast","Not.breakfast": 1 1 2 2 1 2 1 2 1 1 ...
##  $ tea.time        : Factor w/ 2 levels "Not.tea time",..: 1 1 2 1 1 1 2 2 2 1 ...
##  $ evening         : Factor w/ 2 levels "evening","Not.evening": 2 2 1 2 1 2 2 1 2 1 ...
##  $ lunch           : Factor w/ 2 levels "lunch","Not.lunch": 2 2 2 2 2 2 2 2 2 2 ...
##  $ dinner          : Factor w/ 2 levels "dinner","Not.dinner": 2 2 1 1 2 1 2 2 2 2 ...
##  $ always          : Factor w/ 2 levels "always","Not.always": 2 2 2 2 1 2 2 2 2 2 ...
##  $ home            : Factor w/ 2 levels "home","Not.home": 1 1 1 1 1 1 1 1 1 1 ...
##  $ work            : Factor w/ 2 levels "Not.work","work": 1 1 2 1 1 1 1 1 1 1 ...
##  $ tearoom         : Factor w/ 2 levels "Not.tearoom",..: 1 1 1 1 1 1 1 1 1 2 ...
##  $ friends         : Factor w/ 2 levels "friends","Not.friends": 2 2 1 2 2 2 1 2 2 2 ...
##  $ resto           : Factor w/ 2 levels "Not.resto","resto": 1 1 2 1 1 1 1 1 1 1 ...
##  $ pub             : Factor w/ 2 levels "Not.pub","pub": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Tea             : Factor w/ 3 levels "black","Earl Grey",..: 1 1 2 2 2 2 2 1 2 1 ...
##  $ How             : Factor w/ 4 levels "alone","lemon",..: 1 3 1 1 1 1 1 3 3 1 ...
##  $ sugar           : Factor w/ 2 levels "No.sugar","sugar": 2 1 1 2 1 1 1 1 1 1 ...
##  $ how             : Factor w/ 3 levels "tea bag","tea bag+unpackaged",..: 1 1 1 1 1 1 1 1 2 2 ...
##  $ where           : Factor w/ 3 levels "chain store",..: 1 1 1 1 1 1 1 1 2 2 ...
##  $ price           : Factor w/ 6 levels "p_branded","p_cheap",..: 4 6 6 6 6 3 6 6 5 5 ...
##  $ age             : int  39 45 47 23 48 21 37 36 40 37 ...
##  $ sex             : Factor w/ 2 levels "F","M": 2 1 1 2 2 2 2 1 2 2 ...
##  $ SPC             : Factor w/ 7 levels "employee","middle",..: 2 2 4 6 1 6 5 2 5 5 ...
##  $ Sport           : Factor w/ 2 levels "Not.sportsman",..: 2 2 2 1 2 2 2 2 2 1 ...
##  $ age_Q           : Factor w/ 5 levels "15-24","25-34",..: 3 4 4 1 4 1 3 3 3 3 ...
##  $ frequency       : Factor w/ 4 levels "1/day","1 to 2/week",..: 1 1 3 1 3 1 4 2 3 3 ...
##  $ escape.exoticism: Factor w/ 2 levels "escape-exoticism",..: 2 1 2 1 1 2 2 2 2 2 ...
##  $ spirituality    : Factor w/ 2 levels "Not.spirituality",..: 1 1 1 2 2 1 1 1 1 1 ...
##  $ healthy         : Factor w/ 2 levels "healthy","Not.healthy": 1 1 1 1 2 1 1 1 2 1 ...
##  $ diuretic        : Factor w/ 2 levels "diuretic","Not.diuretic": 2 1 1 2 1 2 2 2 2 1 ...
##  $ friendliness    : Factor w/ 2 levels "friendliness",..: 2 2 1 2 1 2 2 1 2 1 ...
##  $ iron.absorption : Factor w/ 2 levels "iron absorption",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ feminine        : Factor w/ 2 levels "feminine","Not.feminine": 2 2 2 2 2 2 2 1 2 2 ...
##  $ sophisticated   : Factor w/ 2 levels "Not.sophisticated",..: 1 1 1 2 1 1 1 2 2 1 ...
##  $ slimming        : Factor w/ 2 levels "No.slimming",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ exciting        : Factor w/ 2 levels "exciting","No.exciting": 2 1 2 2 2 2 2 2 2 2 ...
##  $ relaxing        : Factor w/ 2 levels "No.relaxing",..: 1 1 2 2 2 2 2 2 2 2 ...
##  $ effect.on.health: Factor w/ 2 levels "effect on health",..: 2 2 2 2 2 2 2 2 2 2 ...
dim(tea)
## [1] 300  36
#ggpairs(tea)


tea_s <- subset(tea, select= -age) #drop age from df

mca <- MCA(tea_s, graph=F)
summary(mca)
## 
## Call:
## MCA(X = tea_s, graph = F) 
## 
## 
## Eigenvalues
##                        Dim.1   Dim.2   Dim.3   Dim.4   Dim.5   Dim.6   Dim.7
## Variance               0.090   0.082   0.070   0.063   0.056   0.053   0.050
## % of var.              5.838   5.292   4.551   4.057   3.616   3.465   3.272
## Cumulative % of var.   5.838  11.130  15.681  19.738  23.354  26.819  30.091
##                        Dim.8   Dim.9  Dim.10  Dim.11  Dim.12  Dim.13  Dim.14
## Variance               0.048   0.047   0.044   0.041   0.040   0.039   0.037
## % of var.              3.090   3.053   2.834   2.643   2.623   2.531   2.388
## Cumulative % of var.  33.181  36.234  39.068  41.711  44.334  46.865  49.252
##                       Dim.15  Dim.16  Dim.17  Dim.18  Dim.19  Dim.20  Dim.21
## Variance               0.036   0.035   0.034   0.032   0.031   0.031   0.030
## % of var.              2.302   2.275   2.172   2.085   2.013   2.011   1.915
## Cumulative % of var.  51.554  53.829  56.000  58.086  60.099  62.110  64.025
##                       Dim.22  Dim.23  Dim.24  Dim.25  Dim.26  Dim.27  Dim.28
## Variance               0.028   0.027   0.026   0.025   0.025   0.024   0.024
## % of var.              1.847   1.740   1.686   1.638   1.609   1.571   1.524
## Cumulative % of var.  65.872  67.611  69.297  70.935  72.544  74.115  75.639
##                       Dim.29  Dim.30  Dim.31  Dim.32  Dim.33  Dim.34  Dim.35
## Variance               0.023   0.022   0.021   0.020   0.020   0.019   0.019
## % of var.              1.459   1.425   1.378   1.322   1.281   1.241   1.222
## Cumulative % of var.  77.099  78.523  79.901  81.223  82.504  83.745  84.967
##                       Dim.36  Dim.37  Dim.38  Dim.39  Dim.40  Dim.41  Dim.42
## Variance               0.018   0.017   0.017   0.016   0.015   0.015   0.014
## % of var.              1.152   1.092   1.072   1.019   0.993   0.950   0.924
## Cumulative % of var.  86.119  87.211  88.283  89.301  90.294  91.244  92.169
##                       Dim.43  Dim.44  Dim.45  Dim.46  Dim.47  Dim.48  Dim.49
## Variance               0.014   0.013   0.012   0.011   0.011   0.010   0.010
## % of var.              0.891   0.833   0.792   0.729   0.716   0.666   0.660
## Cumulative % of var.  93.060  93.893  94.684  95.414  96.130  96.796  97.456
##                       Dim.50  Dim.51  Dim.52  Dim.53  Dim.54
## Variance               0.009   0.009   0.008   0.007   0.006
## % of var.              0.605   0.584   0.519   0.447   0.390
## Cumulative % of var.  98.060  98.644  99.163  99.610 100.000
## 
## Individuals (the 10 first)
##                  Dim.1    ctr   cos2    Dim.2    ctr   cos2    Dim.3    ctr
## 1             | -0.580  1.246  0.174 |  0.155  0.098  0.012 |  0.052  0.013
## 2             | -0.376  0.522  0.108 |  0.293  0.350  0.066 | -0.164  0.127
## 3             |  0.083  0.026  0.004 | -0.155  0.099  0.015 |  0.122  0.071
## 4             | -0.569  1.196  0.236 | -0.273  0.304  0.054 | -0.019  0.002
## 5             | -0.145  0.078  0.020 | -0.142  0.083  0.019 |  0.002  0.000
## 6             | -0.676  1.693  0.272 | -0.284  0.330  0.048 | -0.021  0.002
## 7             | -0.191  0.135  0.027 |  0.020  0.002  0.000 |  0.141  0.095
## 8             | -0.043  0.007  0.001 |  0.108  0.047  0.009 | -0.089  0.038
## 9             | -0.027  0.003  0.000 |  0.267  0.291  0.049 |  0.341  0.553
## 10            |  0.205  0.155  0.028 |  0.366  0.546  0.089 |  0.281  0.374
##                 cos2  
## 1              0.001 |
## 2              0.021 |
## 3              0.009 |
## 4              0.000 |
## 5              0.000 |
## 6              0.000 |
## 7              0.015 |
## 8              0.006 |
## 9              0.080 |
## 10             0.052 |
## 
## Categories (the 10 first)
##                  Dim.1    ctr   cos2 v.test    Dim.2    ctr   cos2 v.test  
## breakfast     |  0.182  0.504  0.031  3.022 |  0.020  0.007  0.000  0.330 |
## Not.breakfast | -0.168  0.465  0.031 -3.022 | -0.018  0.006  0.000 -0.330 |
## Not.tea time  | -0.556  4.286  0.240 -8.468 |  0.004  0.000  0.000  0.065 |
## tea time      |  0.431  3.322  0.240  8.468 | -0.003  0.000  0.000 -0.065 |
## evening       |  0.276  0.830  0.040  3.452 | -0.409  2.006  0.087 -5.109 |
## Not.evening   | -0.144  0.434  0.040 -3.452 |  0.214  1.049  0.087  5.109 |
## lunch         |  0.601  1.678  0.062  4.306 | -0.408  0.854  0.029 -2.924 |
## Not.lunch     | -0.103  0.288  0.062 -4.306 |  0.070  0.147  0.029  2.924 |
## dinner        | -1.105  2.709  0.092 -5.240 | -0.081  0.016  0.000 -0.386 |
## Not.dinner    |  0.083  0.204  0.092  5.240 |  0.006  0.001  0.000  0.386 |
##                Dim.3    ctr   cos2 v.test  
## breakfast     -0.107  0.225  0.011 -1.784 |
## Not.breakfast  0.099  0.208  0.011  1.784 |
## Not.tea time   0.062  0.069  0.003  0.950 |
## tea time      -0.048  0.054  0.003 -0.950 |
## evening        0.344  1.653  0.062  4.301 |
## Not.evening   -0.180  0.864  0.062 -4.301 |
## lunch          0.240  0.343  0.010  1.719 |
## Not.lunch     -0.041  0.059  0.010 -1.719 |
## dinner         0.796  1.805  0.048  3.777 |
## Not.dinner    -0.060  0.136  0.048 -3.777 |
## 
## Categorical variables (eta2)
##                 Dim.1 Dim.2 Dim.3  
## breakfast     | 0.031 0.000 0.011 |
## tea.time      | 0.240 0.000 0.003 |
## evening       | 0.040 0.087 0.062 |
## lunch         | 0.062 0.029 0.010 |
## dinner        | 0.092 0.000 0.048 |
## always        | 0.056 0.035 0.007 |
## home          | 0.016 0.002 0.030 |
## work          | 0.075 0.020 0.022 |
## tearoom       | 0.321 0.019 0.031 |
## friends       | 0.186 0.061 0.030 |
plot(mca, invisible=c("ind"), habillage="quali", graph.type = "ggplot") 

 plotellipses(mca, graph.type = c("ggplot"))

keep_b <- c("Sport", "diuretic", "frequency", "where", "how", "sugar", "slimming", "relaxing", "effect.on.health", "feminine", "sex", "age_Q")

tea_b <- dplyr::select(tea, all_of(keep_b))

mca_b <- MCA(tea_b, graph=F)
summary(mca_b)
## 
## Call:
## MCA(X = tea_b, graph = F) 
## 
## 
## Eigenvalues
##                        Dim.1   Dim.2   Dim.3   Dim.4   Dim.5   Dim.6   Dim.7
## Variance               0.166   0.154   0.138   0.116   0.106   0.098   0.092
## % of var.             10.473   9.741   8.710   7.331   6.689   6.188   5.788
## Cumulative % of var.  10.473  20.214  28.923  36.255  42.944  49.132  54.920
##                        Dim.8   Dim.9  Dim.10  Dim.11  Dim.12  Dim.13  Dim.14
## Variance               0.086   0.077   0.076   0.069   0.065   0.064   0.058
## % of var.              5.451   4.870   4.819   4.385   4.099   4.042   3.678
## Cumulative % of var.  60.372  65.242  70.061  74.445  78.545  82.587  86.265
##                       Dim.15  Dim.16  Dim.17  Dim.18  Dim.19
## Variance               0.055   0.052   0.042   0.038   0.031
## % of var.              3.491   3.267   2.645   2.385   1.946
## Cumulative % of var.  89.757  93.024  95.669  98.054 100.000
## 
## Individuals (the 10 first)
##                         Dim.1    ctr   cos2    Dim.2    ctr   cos2    Dim.3
## 1                    | -0.571  0.655  0.224 |  0.093  0.019  0.006 |  0.055
## 2                    |  0.161  0.052  0.023 |  0.023  0.001  0.000 |  0.159
## 3                    |  0.248  0.124  0.065 | -0.165  0.058  0.028 |  0.026
## 4                    | -0.639  0.822  0.379 | -0.317  0.217  0.093 |  0.069
## 5                    |  0.013  0.000  0.000 | -0.004  0.000  0.000 |  0.115
## 6                    | -0.559  0.627  0.312 | -0.232  0.117  0.054 | -0.036
## 7                    | -0.321  0.207  0.056 |  0.094  0.019  0.005 | -0.083
## 8                    | -0.060  0.007  0.002 | -0.380  0.312  0.088 |  0.147
## 9                    | -0.084  0.014  0.004 |  0.297  0.191  0.055 | -0.774
## 10                   |  0.254  0.129  0.040 |  0.293  0.186  0.053 | -0.612
##                         ctr   cos2  
## 1                     0.007  0.002 |
## 2                     0.061  0.023 |
## 3                     0.002  0.001 |
## 4                     0.011  0.004 |
## 5                     0.032  0.013 |
## 6                     0.003  0.001 |
## 7                     0.017  0.004 |
## 8                     0.053  0.013 |
## 9                     1.448  0.375 |
## 10                    0.906  0.233 |
## 
## Categories (the 10 first)
##                          Dim.1     ctr    cos2  v.test     Dim.2     ctr
## Not.sportsman        |   0.406   3.338   0.111   5.769 |  -0.066   0.095
## sportsman            |  -0.274   2.256   0.111  -5.769 |   0.045   0.064
## diuretic             |   0.406   4.808   0.228   8.253 |   0.039   0.048
## Not.diuretic         |  -0.561   6.640   0.228  -8.253 |  -0.054   0.066
## 1/day                |  -0.367   2.146   0.062  -4.323 |   0.086   0.126
## 1 to 2/week          |  -0.698   3.595   0.084  -5.007 |  -0.317   0.795
## +2/day               |   0.466   4.621   0.159   6.905 |  -0.092   0.194
## 3 to 6/week          |   0.189   0.203   0.005   1.168 |   0.514   1.618
## chain store          |  -0.221   1.573   0.087  -5.099 |  -0.449   6.958
## chain store+tea shop |   0.497   3.233   0.087   5.098 |   0.367   1.893
##                         cos2  v.test     Dim.3     ctr    cos2  v.test  
## Not.sportsman          0.003  -0.940 |   0.223   1.214   0.034   3.172 |
## sportsman              0.003   0.940 |  -0.151   0.820   0.034  -3.172 |
## diuretic               0.002   0.794 |   0.145   0.741   0.029   2.955 |
## Not.diuretic           0.002  -0.794 |  -0.201   1.024   0.029  -2.955 |
## 1/day                  0.003   1.010 |   0.091   0.158   0.004   1.069 |
## 1 to 2/week            0.017  -2.270 |   0.808   5.793   0.112   5.796 |
## +2/day                 0.006  -1.364 |  -0.289   2.140   0.061  -4.285 |
## 3 to 6/week            0.034   3.177 |  -0.220   0.331   0.006  -1.358 |
## chain store            0.358 -10.342 |   0.299   3.461   0.159   6.898 |
## chain store+tea shop   0.047   3.762 |  -1.114  19.493   0.436 -11.417 |
## 
## Categorical variables (eta2)
##                        Dim.1 Dim.2 Dim.3  
## Sport                | 0.111 0.003 0.034 |
## diuretic             | 0.228 0.002 0.029 |
## frequency            | 0.210 0.051 0.139 |
## where                | 0.097 0.531 0.476 |
## how                  | 0.012 0.516 0.596 |
## sugar                | 0.288 0.020 0.002 |
## slimming             | 0.104 0.028 0.146 |
## relaxing             | 0.039 0.117 0.011 |
## effect.on.health     | 0.001 0.012 0.065 |
## feminine             | 0.252 0.103 0.038 |
plot(mca_b, invisible=c("ind"), habillage="quali", graph.type="ggplot")

 plotellipses(mca_b, graph.type = c("ggplot"))